106 research outputs found

    Replica-molded electro-optic polymer Mach–Zehnder modulator

    Get PDF
    A Mach-Zehnder electro-optic polymer amplitude modulator is fabricated by a simple and high-throughput soft-stamp replica-molding technique. The modulator structure incorporates the highly nonlinear and stable chromophore, AJL8, doped in amorphous polycarbonate. Single-arm phase-retardation results in a halfwave voltage (V-pi) of 8.4 V at 1600 nm. The on/off extinction ratio is better than 19 dB, resulting from precise Y-branch power splitters and good waveguide uniformity. These results indicate that the simple fabrication process allows for good optical performance from high-fidelity replicas of the original master devices

    Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

    Full text link
    Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL submissions

    Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

    Full text link
    Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202

    Broadband energy-efficient optical modulation by hybrid integration of silicon nanophotonics and organic electro-optic polymer

    Full text link
    Silicon-organic hybrid integrated devices have emerging applications ranging from high-speed optical interconnects to photonic electromagnetic-field sensors. Silicon slot photonic crystal waveguides (PCWs) filled with electro-optic (EO) polymers combine the slow-light effect in PCWs with the high polarizability of EO polymers, which promises the realization of high-performance optical modulators. In this paper, a broadband, power-efficient, low-dispersion, and compact optical modulator based on an EO polymer filled silicon slot PCW is presented. A small voltage-length product of V{\pi}*L=0.282Vmm is achieved, corresponding to an unprecedented record-high effective in-device EO coefficient (r33) of 1230pm/V. Assisted by a backside gate voltage, the modulation response up to 50GHz is observed, with a 3-dB bandwidth of 15GHz, and the estimated energy consumption is 94.4fJ/bit at 10Gbit/s. Furthermore, lattice-shifted PCWs are utilized to enhance the optical bandwidth by a factor of ~10X over other modulators based on non-band-engineered PCWs and ring-resonators.Comment: 12 pages, 4 figures, SPIE Photonics West Conference 201

    PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

    Full text link
    Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, detection and Re-IDentification (ReID). Despite significant progress, two major challenges remain: 1) Detection-prior modules in previous methods are suboptimal for the ReID task. 2) The collaboration between two sub-tasks is ignored. To alleviate these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Unlike existing methods that follow the Detection-to-ReID paradigm, our denoising paradigm eliminates detection-prior modules to avoid the local-optimum of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead

    Uformer: A Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation

    Full text link
    Complex spectrum and magnitude are considered as two major features of speech enhancement and dereverberation. Traditional approaches always treat these two features separately, ignoring their underlying relationship. In this paper, we propose Uformer, a Unet based dilated complex & real dual-path conformer network in both complex and magnitude domain for simultaneous speech enhancement and dereverberation. We exploit time attention (TA) and dilated convolution (DC) to leverage local and global contextual information and frequency attention (FA) to model dimensional information. These three sub-modules contained in the proposed dilated complex & real dual-path conformer module effectively improve the speech enhancement and dereverberation performance. Furthermore, hybrid encoder and decoder are adopted to simultaneously model the complex spectrum and magnitude and promote the information interaction between two domains. Encoder decoder attention is also applied to enhance the interaction between encoder and decoder. Our experimental results outperform all SOTA time and complex domain models objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on the blind test set of Interspeech 2021 DNS Challenge, which outperforms all top-performed models. We also carry out ablation experiments to tease apart all proposed sub-modules that are most important.Comment: Accepted by ICASSP 202

    SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

    Full text link
    Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability

    Short hybrid polymer/sol-gel silica waveguide switches with high in-device electro-optic coefficient based on photostable chromophore

    Get PDF
    The highest electro-optic (EO) coefficient to date is achieved in short polymeric directional coupler switches based on hybrid EO polymer/sol-gel silica waveguides. Optimized poling conditions in such waveguides give a highest in-device EO coefficient of 160 pm/V at 1550 nm using highly efficient and photostable guest–host EO polymer SEO100. Adiabatic waveguide transitions from the passive sol-gel core to active EO polymer cores surrounding the sol-gel core are shown using EO polymer cores with a coplanar tapered structure. Switching voltages of 8.4 and 10.5 V are achieved for electrodes that are 2.1 and 1.5 mm long, respectively, which are half those of EO switches containing the chromophore AJLS102
    • …
    corecore